A simple analysis of how Cabify's customers are using the service.
import numpy as np
import pandas as pd
import seaborn as sns
import datetime
import os
sns.set()
import colorlover as cl
import copy
from IPython.display import Image
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
init_notebook_mode(connected=True)
loc = os.getcwd()
df_prueba_selec = pd.read_csv(loc+'/prueba_seleccion.csv',dtype='str')
df_prueba_selec.fillna(method='ffill', inplace=True) # rather than just removing the NaN-values, let's keep them;
# filling out the NaN-values with the previous value. (Failing in
# doing so will remove almost the entire dataset.)
pd.set_option('display.max_columns', 100)
print("Total amount of rows: {0}".format(df_prueba_selec.shape[0]))
print("Total amount of columns: {0}".format(df_prueba_selec.shape[1]))
df_prueba_selec.head(5)
pd.set_option('display.max_rows', 100)
# Load the price, cost and distance for each trip
price = df_prueba_selec.loc[:, "price"].str.replace(',', '')
cost = df_prueba_selec.loc[:, "cost"].str.replace(',', '')
distance = df_prueba_selec.loc[:, "distance"].str.replace(',', '')
# Remove NaN values
price.fillna(0, inplace=True)
cost.fillna(0, inplace=True)
distance.fillna(0, inplace=True)
distance = pd.DataFrame(distance.astype("int"), columns=["distance"])
# Calculate the profit for each ride. Divide with 100 to get it in €.
profit = pd.DataFrame((price.astype("int")-cost.astype("int"))/100, columns=['profit'])
# Add rider_id to the Dataframe
riders = pd.DataFrame(df_prueba_selec.loc[:, "rider_id"])
cleaned_data = pd.concat([riders, profit, distance], axis=1)
riders = pd.value_counts(df_prueba_selec.loc[:, "rider_id"])
data1 = go.Bar(x=riders[:226].index,
y=riders[:226].values,
marker=dict(color='rgba(60, 179, 113,1)'),
name="Top 25%")
data2 = go.Bar(x=riders[226:-226].index,
y=riders[226:-226].values,
marker=dict(color='rgba(181,126,220,1)'),
name="Happy Medium")
data3 = go.Bar(x=riders[-226:].index,
y=riders[-226:].values,
marker=dict(color='rgba(255, 165, 0, 1)'),
name="Bottom 25%")
layout = go.Layout(
title='Number of trips per rider. <br>(Hover for rider ID. Pan and Zoom using the tools above to the right)',
yaxis=dict(title='Number of Trips'),
xaxis=dict(title = 'Rider ID', showticklabels=False),
autosize=False,
width=900,
height=500,
margin=go.layout.Margin(l=120, r=120, b=100, t=100, pad=4))
fig = go.Figure(data=[data1, data2, data3], layout=layout)
py.offline.iplot(fig)
# Calculate some statistics
total_rides = riders.values.sum()
top_25 = riders.values[:226].sum() # Sum the total rides for the top 25%
bottom_25 = riders.values[-226:].sum() # Sum the total rides for the bottom 25%
print("Top 25% of the customers stands for {0:0.0f}% of the total rides.".format((top_25/total_rides)*100))
print("Bottom 25% of the customers stands for {0:0.1f}% of the total rides.".format((bottom_25/total_rides)*100))
Some few riders make a huge amount of total rides. As a matter of fact, 25% of the top riders make 74% of all the rides. It would be great for Cabify to build some special customer loyalty to these top 25% in order to keep them using the service.
On the other hand, focusing the effort on re-attracting the bottom 25% to start using the service again or to use it more through discounts or "free rides" might be a good way to increse these customers' use frequency. As of today, they only contribute with 1.2% of the total rides. They already know about the Cabify brand, thus, it might be easier/cheaper to convince them to use the service again, or with a higher frequency, than convincing completely new customers.
It might be interesting to know how long the longest rides have been.
# Calculate the longest trips.
longest_trips = copy.deepcopy(cleaned_data)
longest_trips.sort_values(by=["distance"], inplace=True, ascending=False)
data1 = go.Bar(x=longest_trips.iloc[:47, 0].values,
y=longest_trips.iloc[:47, 2].values/1000,
marker=dict(color='rgba(60, 179, 113,1)'),
name="Above 50km") # up to position 47 included rides longer than 50km
data2 = go.Bar(x=longest_trips.iloc[47:1000, 0].values,
y=longest_trips.iloc[47:1000, 2].values/1000,
marker=dict(color='rgba(181,126,220,1)'),
name="Below 50km")
layout = go.Layout(
title='Top 1000 longest trips (out of 52,645 total trips)',
yaxis=dict(title='Distance (km)'),
xaxis=dict(title = 'Rider Id', showticklabels=False),
autosize=False,
width=1000,
height=500,
margin=go.layout.Margin(l=120, r=120, b=100, t=100, pad=4),
legend=dict(x=0.85, y=.93))
fig = go.Figure(data=[data1, data2], layout=layout)
py.offline.iplot(fig)
From the plot above, we can see that there are actually only two trips that are longer than 100km. One is as long as 614km! The vast majority are shorter than 50km though. Note that many of the riders have made several trips (resulting in both green and purple piles to the left).
# Calculate the profit per rider and display the most profitable ones
cleaned_data_per_rider = cleaned_data.groupby(['rider_id'])['profit', 'distance'].sum()
cleaned_data_per_rider.sort_values(by=["profit"], inplace=True, ascending=False)
data1 = go.Bar(x=cleaned_data_per_rider.iloc[:181].index,
y=cleaned_data_per_rider.iloc[:181, 0].values,
marker=dict(color='rgba(60, 179, 113, 1)'),
name="Top 20% most profitable riders")
data2 = go.Bar(x=cleaned_data_per_rider.iloc[181:-181].index,
y=cleaned_data_per_rider.iloc[181:-181, 0].values,
marker=dict(color='rgba(181,126,220,1)'),
name="Potentially top 20%")
data3 = go.Bar(x=cleaned_data_per_rider.iloc[-181:].index,
y=cleaned_data_per_rider.iloc[-181:, 0].values,
marker=dict(color='rgba(255, 165, 0, 1)'),
name="Bottom 20%")
layout = go.Layout(
title='Riders and profitability',
yaxis=dict(title='Profits'),
xaxis=dict(title = 'Rider (Pan for ID)', showticklabels=False),
width=1000,
height=600,
margin=go.layout.Margin(l=120, r=120, b=100, t=100, pad=5),
legend=dict(x=0.7, y=.95))
# Plot it
fig = go.Figure(data=[data1, data2, data3], layout=layout)
py.offline.iplot(fig)
# Calculate some statistics
total_profit = cleaned_data_per_rider.loc[:, "profit"].sum()
top_20 = cleaned_data_per_rider.iloc[:181, 0].sum() # Top 20%
bottom_20 = cleaned_data_per_rider.iloc[-181:, 0].sum() # Bottom 20%
print("Top 20% of the customers stands for {0:0.0f}% of the total profit.".format((top_20/total_profit)*100))
print("Bottom 20% of the customers stands for {0:0.1f}% of the total profit.".format((bottom_20/total_profit)*100))
Assuming that "cost" is the total cost for Cabify to provide the service to the customer and "price" is the total price the customer pays for this service, above figure displays the profitability for each rider with the top 20 percent displayed in green (pan the figure for more info on each rider). Profit is calculated as the difference between price and cost.
As Cabify wants to keep its most profitable customers, it migth be a good idea to offer them special programs, discounts or other advantages to keep them as customers. Targeting these 20% might thus be a good idea. Interesting to note is that 75% of the profit originates from only 20% of the customers. This is in close accordance with the pareto principle, or "80/20 rule".
Subsequently, the bottom 20% might instead be targeted with campaigns with the objective to increase their profitability. As of now, they only contribute with 0.4% of the total profit. Low profitable customers are surely better than no customers at all, but Cabify would do good in understanding why these customers aren't more profitable. Maybe they had a bad first experience, maybe they can't find a ride when they need it, the price, or maybe there are other reasons? Understanding these customers' needs might make them more profitable.
rest = copy.deepcopy(cleaned_data_per_rider)
# Get the top riders in terms of profit and distance and remove the equivalent data from the original dataframe
top_profit = rest[rest.loc[:, "profit"] >= 1000]
rest.drop(rest[rest.profit >= 1000].index, inplace=True)
top_distance = rest[rest.loc[:, "distance"] >= 4000000]
rest.drop(rest[rest.distance >= 4000000].index, inplace=True)
# Concatenate the top riders
special_treatment = pd.concat([top_profit, top_distance], axis=0)
x1 = rest.loc[:, "distance"].values/1000
y1 = rest.loc[:, "profit"].values
x2 = special_treatment.loc[:, "distance"].values/1000
y2 = special_treatment.loc[:, "profit"].values
data1 = go.Scatter(x = x1,
y = y1,
mode = 'markers',
marker=dict(color='rgba(181,126,220,1)'),
name="Great clients")
data2 = go.Scatter(x = x2,
y = y2,
mode = 'markers',
marker=dict(color='rgba(60, 179, 113, 1)'),
name="Excellent clients")
layout = go.Layout(title='Profit vs Distance',
yaxis=dict(title='Profit (€)'),
xaxis=dict(title = 'Distance (Km)'),
autosize=False,
width=900,
height=600,
legend=dict(x=0.8, y=.95))
fig = go.Figure(data=[data1, data2], layout=layout)
py.offline.iplot(fig)
Above figure shows us a linear relationship (a part from some exceptions) between the total distance each client has made with Cabify and their profitability. This means that the more distance a client has made, the more profitable they are as well. One way to reward clients might be to reward those that have made a total distance of more than 4000 km or earned a profit for Cabify of more than 1000 € (both displayed in green). Some clients are highly profitable (see top left of the graph) as they have contributed to high profits during relatively short distances. These are Cabify's "golden" customers - maintaining them is thus a priority!
# Find out HOW people make their Cabify reservations
source = pd.value_counts(df_prueba_selec.loc[:, "source"])
data = [go.Bar(x=source.index,
y=source.values,
marker=dict(
color='rgba(181,126,220,1)'))]
layout = go.Layout(title='How do riders book their CABify?',
yaxis=dict(title='Amount'),
xaxis=dict(title = 'Source'),
width=700,
height=400)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
Combining above information, where we see that Android and iPhone are the most used sources for booking a ride, with the statistics on this site, it might be an interesting idea to cooperate with Huawei and Xiaomi (Android) as their market share in Spain is increasing while Samsung's is decreasing. A close cooperation with them might allow more potential customers to learn about Cabify!
Let's start out by finding out which region is the most frequent one for Cabify rides:
region = pd.value_counts(df_prueba_selec.loc[:, "region_id"])
data = [go.Bar(x=region.index,
y=region.values,
marker=dict(
color='rgba(181,126,220,1)'))]
layout = go.Layout(title='Which region is the most frequent one?',
yaxis=dict(title='Frequency'),
xaxis=dict(title = 'Region'),
width=700,
height=400)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
As Madrid is the by far most frequent region (around 80% of the total rides), we will focus our attention here.
It migth also be interesting to know which days are the most frequent ones. First we have to add the corresponding day for each date to the original data file. This is done with the datetime module in python.
# Use the datetime module to find out which day each date corresponds to in the dataset.
days = []
for i in range(len(df_prueba_selec)):
day = datetime.datetime.strptime(str(df_prueba_selec.iloc[i, 22]), '%Y-%m-%d %H:%M:%S').strftime('%A')
days.append(day)
# Add the corresponding days to the dataframe
df_prueba_selec['weekday'] = pd.DataFrame(days, index=df_prueba_selec.index)
df_prueba_selec.to_csv(loc+'/prueba_seleccion_added.csv') # store it locally
(df_prueba_selec).head(1)
df_prueba_selec = df_prueba_selec.loc[df_prueba_selec["region_id"] == "madrid"] # keep the Madrid region only
day = pd.value_counts(df_prueba_selec.loc[:, "weekday"])
day = day.loc[["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]] # custom sort
data1 = [go.Bar(x=day.index,
y=day.values,
marker=dict(
color='rgba(181,126,220,1)'))]
layout2 = go.Layout(title='Which is the most frequent Cabify day in the Madrid area?',
yaxis=dict(title='Frequency'),
xaxis=dict(title = 'Day'),
width=700,
height=400)
fig1 = go.Figure(data=data1, layout=layout2)
py.offline.iplot(fig1)
# *****************************
start_city = pd.value_counts(df_prueba_selec.loc[:, "start_city"])
data2 = [go.Bar(x=start_city.iloc[:10].index,
y=start_city.iloc[:10].values,
marker=dict(
color='rgba(181,126,220,1)'))]
layout2 = go.Layout(title='And where? (10 most frequent)',
yaxis=dict(title='Frequency'),
xaxis=dict(title = 'Region'),
width=700,
height=400,
margin=go.layout.Margin(l=50, r=75, b=100, t=50, pad=5))
fig2 = go.Figure(data=data2, layout=layout2)
py.offline.iplot(fig2)
print()
print()
print("Total number of rides in the Madrid area: {0}".format(len(df_prueba_selec)))
As we can se from the above plot, Thursday offers the most rides during the week. Surprisingly, the number of rides during the weekends are just a fraction of that of a weekday.
The majority of the rides start in Madrid (68%), with a significant amount starting in Boadilla del Monte as well (26%) (see above). The below picture displays all the start locations in the Madrid area, plotted in Tableau. Follow this link for access to the workbook. We can chose to focus our attention on several areas, but for the sake of this exercise let us concentrate on two main areas; Boadilla del Monte and around Nuevos Ministerios, both displayed in green. Let's see at what time Cabify should focus their fleet on these areas!
Image(filename = loc + "/Madrid_locations.png")
B_del_Monte = copy.deepcopy(df_prueba_selec.loc[df_prueba_selec["start_city"] == "Boadilla del Monte"])
# Convert date and time to a single hour.
hours = []
for i in range(len(B_del_Monte)):
hour = datetime.datetime.strptime(str(B_del_Monte.iloc[i, 22]), '%Y-%m-%d %H:%M:%S').strftime('%H') # get hour
hours.append(hour)
# Add the corresponding hour to the dataframe
B_del_Monte['hour'] = pd.DataFrame(hours, index=B_del_Monte.index)
hours = pd.value_counts(B_del_Monte.loc[:, "hour"])
data = go.Bar(x = hours.index,
y = hours.values,
marker=dict(color='rgba(181,126,220,1)'))
layout = go.Layout(title='Boadilla del Monte; Occurrences per hour.',
yaxis=dict(title='#Occurrences'),
xaxis=dict(title = 'Hour'),
autosize=False,
width=900,
height=600)
fig = go.Figure(data=[data], layout=layout)
py.offline.iplot(fig)
As we can see from the above plot, there's a big peak in rider activity between around 17.00 and 20.00. Increasing the number of CABifys during this time in Boadilla de Monte would thus be a good idea. Further investigations can be made to learn exactly which weekdays are the most frequent ones, but as we saw earlier, Monday to Friday are the by far most frequent ones. These hours also makes sense since it's during this time people wants to get back home after finishing work.
As the dataset doesn't provide us with a tag "Nuevos Ministerios", we have to find other ways to select the data of interest. One way to do this is to display all the datapoints in Tableau, select those around Nuevos Ministerios and save the file before processing it the same way as we did for Boadilla de Monte. Let's do that! The corresponding Tableau workbook can be reached through this link.
# Read the file obtained from the Tableau workbook.
loc = os.getcwd()
df_Nuevos_Ministerios = pd.read_csv(loc+'/Nuevos_Ministerios.csv', delimiter=";")
df_Nuevos_Ministerios.fillna(method='ffill', inplace=True)
print("Total rides from Nuevos Ministerios: {0}".format(len(df_Nuevos_Ministerios)))
df_Nuevos_Ministerios.head(1)
hours1 = pd.value_counts(df_Nuevos_Ministerios.loc[:, "Hour of Pick Up At"])
data = go.Bar(x = hours1.index,
y = hours1.values,
marker=dict(color='rgba(181,126,220,1)'))
layout = go.Layout(title='Nuevos Ministerios; Occurrences per hour.',
yaxis=dict(title='#Occurrences'),
xaxis=dict(title = 'Hour'),
autosize=False,
width=900,
height=600)
fig = go.Figure(data=[data], layout=layout)
py.offline.iplot(fig)
The rides from Nuevos Ministerios are more evenly spread across the day. However, there is a peak during the morning rush hours between 06.00 and 09.00. Increasing the fleet size in this area at this time would thus be a good idea. Keeping roughly half of the fleet until 22.00 to cover the demand would further be advisable.